#direct preference
Explore tagged Tumblr posts
canisalbus · 10 months ago
Text
Tumblr media
✦ Siesta ✦
16K notes · View notes
quazies · 1 year ago
Text
Tumblr media Tumblr media Tumblr media Tumblr media
Putting things back together..
8K notes · View notes
marypickfords · 1 year ago
Photo
Tumblr media Tumblr media
Gentlemen Prefer Nature Girls (Doris Wishman, 1963)
4K notes · View notes
winepresswrath · 7 months ago
Text
I am still obsessed with how fucking rude Armand's little script notes were. "We need an animation here to convey how bad the hoarding was," bitch that is your boyfriend you are selling to his death. It's that necessary for you to let a whole theatre of humans know how bad his depression cave got?
439 notes · View notes
crocsfroggo · 1 month ago
Text
Tumblr media Tumblr media
study..? w flirty doods
222 notes · View notes
lully-jo · 5 months ago
Text
I wonder how many people really love and relate to Alisaie because they too were the girl who struggled to understand how folks around her could act like suffering and selfishness and hivemind behavior was normal and their softness was called a weakness and so they built up an abrasive exterior to force people to take them seriously and now they don't know how to be vulnerable without feeling embarrassed or cringe even though they still feel things so deeply it's almost suffocating...
275 notes · View notes
cptnbeefheart · 6 months ago
Text
Tumblr media Tumblr media Tumblr media Tumblr media
last year in my sketchbook: can representational art be more sustainable for me if i take it less seriously? can i balance the two? | my art
246 notes · View notes
jcmarchi · 5 months ago
Text
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
New Post has been published on https://thedigitalinsider.com/longwriter-unleashing-10000-word-generation-from-long-context-llms/
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs
Current long-context large language models (LLMs) can process inputs up to 100,000 tokens, yet they struggle to generate outputs exceeding even a modest length of 2,000 words. Controlled experiments reveal that the model’s effective generation length is inherently limited by the examples seen during supervised fine-tuning (SFT). In other words, this output limitation stems from the scarcity of long-output examples in existing SFT datasets.
Recent advancements in long-context LLMs have led to the development of models with significantly expanded memory capacities, capable of processing history exceeding 100,000 tokens in length. However, despite their ability to handle extensive inputs, current long-context LLMs struggle to generate equally lengthy outputs.
To explore this limitation, LongWriter probes the maximum output length of state-of-the-art long-context models with multiple queries that require responses of varying lengths, such as “Write a 10,000-word article on the history of the Roman Empire.” The results show that all models consistently fail to produce outputs beyond 2,000 words in length. Meanwhile, analysis of user interaction logs reveals that over 1% of user prompts explicitly request outputs exceeding this limit, highlighting a pressing need in current research to overcome this limitation.
To address this, LongWriter introduces AgentWrite, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words. Leveraging AgentWrite, LongWriter constructs LongWriter-6k, a dataset containing 6,000 SFT data samples with output lengths ranging from 2k to 32k words. By incorporating this dataset into model training, LongWriter successfully scales the output length of existing models to over 10,000 words while maintaining output quality.
LongWriter also develops LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities. The 9B parameter model, further improved through DPO, achieves state-of-the-art performance on this benchmark, surpassing even much larger proprietary models.
In this article, we will discuss the LongWriter framework, explore its architecture, and compare its performance against state-of-the-art long-context large language models. Let’s get started.
Recent advancements in long context large language models (LLMs) have led to the creation of models with significantly increased memory capacities, capable of processing histories that exceed 100,000 tokens. Despite this ability to handle extensive inputs, current long-context LLMs struggle to generate outputs of comparable length. To investigate this limitation, LongWriter examines the maximum output length of state-of-the-art long-context models through various queries that require different response lengths, such as “Write a 10,000-word article on the history of the Roman Empire.” Based on the findings, LongWriter observes that all models consistently fail to generate outputs longer than 2,000 words. Furthermore, an analysis of user interaction logs indicates that over 1% of user prompts specifically request outputs beyond this limit, highlighting an urgent need in current research to address this issue. 
LongWriter’s study reveals a key insight: the constraint on output length is primarily rooted in the characteristics of the Supervised Fine-Tuning (SFT) datasets. Specifically, LongWriter finds that a model’s maximum generation length is effectively capped by the upper limit of output lengths present in its SFT dataset, despite its exposure to much longer sequences during the pretraining phase. This finding explains the ubiquitous 2,000-word generation limit across current models, as existing SFT datasets rarely contain examples exceeding this length. Furthermore, as many datasets are distilled from state-of-the-art LLMs, they also inherit the output length limitation from their source models.
To address this limitation, LongWriter introduces AgentWrite, a novel agent-based pipeline designed to leverage off-the-shelf LLMs to automatically construct extended, coherent outputs. AgentWrite operates in two stages: First, it crafts a detailed writing plan outlining the structure and target word count for each paragraph based on the user’s input. Then, following this plan, it prompts the model to generate content for each paragraph in a sequential manner. LongWriter’s experiments validate that AgentWrite can produce high-quality and coherent outputs of up to 20,000 words.
Building upon the AgentWrite pipeline, LongWriter leverages GPT-4o to generate 6,000 long-output SFT data, named LongWriter-6k, and adds this data to train existing models. Notably, LongWriter-6k successfully unlocks the model’s ability to generate well-structured outputs exceeding 10,000 words in length. To rigorously evaluate the effectiveness of this approach, LongWriter develops the LongBench-Write benchmark, which contains a diverse set of user writing instructions, with output length specifications ranging from 0-500 words, 500-2,000 words, 2,000-4,000 words, and beyond 4,000 words. Evaluation on LongBench-Write shows that LongWriter’s 9B size model achieves state-of-the-art performance, even compared to larger proprietary models. LongWriter further constructs preference data and uses DPO to help the model better follow long writing instructions and generate higher quality written content, which has also been proven effective through experiments.
To summarize, LongWriter’s work makes the following novel contributions:
Analysis of Generation Length Limits: LongWriter identifies the primary factor limiting the output length of current long-context LLMs, which is the constraint on the output length in the SFT data.
AgentWrite: To overcome this limitation, LongWriter proposes AgentWrite, which uses a divide-and-conquer approach with off-the-shelf LLMs to automatically construct SFT data with ultra-long outputs. Using this method, LongWriter constructs the LongWriter-6k dataset.
Scaling Output Window Size of Current LLMs: LongWriter incorporates the LongWriter-6k dataset into its SFT data, successfully scaling the output window size of existing models to 10,000+ words without compromising output quality. LongWriter shows that DPO further enhances the model’s long-text writing capabilities.
AgentWrite: Automatic Data Construction
To utilize off-the-shelf LLMs for automatically generating SFT data with longer outputs, LongWriter designs AgentWrite, a divide-and-conquer style agent pipeline. AgentWrite first breaks down long writing tasks into multiple subtasks, with each subtask requiring the model to write only one paragraph. The model then executes these subtasks sequentially, and LongWriter concatenates the subtask outputs to obtain the final long output. Such an approach of breaking down a complex task into multiple subtasks using LLM agents has already been applied in various fields, such as problem-solving, software development, and model evaluation. LongWriter’s work is the first to explore integrating planning to enable models to complete complex long-form writing tasks. Each step of AgentWrite is introduced in detail below.
Step I: Plan
Inspired by the thought process of human writers, who typically start by making an overall plan for long writing tasks, LongWriter utilizes the planning capabilities of LLMs to output such a writing outline given a writing instruction. This plan includes the main content and word count requirements for each paragraph. The prompt used by LongWriter is as follows:
“I need you to help me break down the following long-form writing instruction into multiple subtasks. Each subtask will guide the writing of one paragraph in the essay and should include the main points and word count requirements for that paragraph. The writing instruction is as follows: User Instruction. Please break it down in the following format, with each subtask taking up one line:
Paragraph 1 – Main Point: [Describe the main point of the paragraph, in detail] – Word Count: [Word count requirement, e.g., 400 words] Paragraph 2 – Main Point: [Describe the main point of the paragraph, in detail] – Word Count: [Word count requirement, e.g. 1000 words].
Make sure that each subtask is clear and specific, and that all subtasks cover the entire content of the writing instruction. Do not split the subtasks too finely; each subtask’s paragraph should be no less than 200 words and no more than 1000 words. Do not output any other content.”
Step II: Write
After obtaining the writing plan from Step I, LongWriter calls the LLM serially to complete each subtask, generating the writing content section by section. To ensure the coherence of the output, when LongWriter calls the model to generate the n-th section, the previously generated n−1 sections are also input, allowing the model to continue writing the next section based on the existing writing history. Although this serial manner prevents parallel calls to the model to complete multiple subtasks simultaneously, and the input length becomes longer, LongWriter shows in validation that the overall coherence and quality of the writing obtained this way are far superior to the output generated in parallel. The prompt in use by LongWriter is:
“You are an excellent writing assistant. I will give you an original writing instruction and my planned writing steps. I will also provide you with the text I have already written. Please help me continue writing the next paragraph based on the writing instruction, writing steps, and the already written text.
Writing instruction: User Instruction Writing steps: The writing plan generated in Step I Already written text: Previous generated (n-1) paragraphs
Please integrate the original writing instruction, writing steps, and the already written text, and now continue writing The plan for the n-th paragraph, i.e., the n-th line in the writing plan.”
Validation
LongWriter tests the generation length and quality of the proposed AgentWrite method on two long-form writing datasets. The first one, LongWrite-Ruler, is used to measure exactly how long of an output the method can provide. The second, LongBench-Write, is mainly used to evaluate how well the model-generated content aligns with user instructions in terms of length and writing quality.
LongBench-Write: To evaluate the model’s performance on a more diverse range of long-form writing instructions, LongWriter collects 120 varied user writing prompts, with 60 in Chinese and 60 in English. To better assess whether the model’s output length meets user requirements, LongWriter ensures that all these instructions include explicit word count requirements. These instructions are divided into four subsets based on the word count requirements: 0-500 words, 500-2,000 words, 2,000-4,000 words, and over 4,000 words. Additionally, the instructions are categorized into seven types based on the output type: Literature and Creative Writing, Academic and Monograph, Popular Science, Functional Writing, News Report, Community Forum, and Education and Training.
During evaluation, LongWriter adopts two metrics: one for scoring the output length and another for scoring the output quality. The model’s output length is scored based on how close it is to the requirements specified in the instructions. For output quality, LongWriter uses the LLM-as-a-judge approach, selecting the state-of-the-art GPT-4o model to score the output across six dimensions: Relevance, Accuracy, Coherence, Clarity, Breadth and Depth, and Reading Experience. The final score is computed by averaging the length score and the quality score.
Validation results: LongWriter presents the output length measurement on LongWrite-Ruler and finds that AgentWrite successfully extends the output length of GPT-4o from a maximum of 2k words to approximately 20k words. LongWriter also assesses both the output quality and adherence to the required output length on LongBench-Write, showing that GPT-4o can successfully complete tasks with outputs under 2,000 words in length when evaluating AgentWrite’s performance.
Supervised Fine-Tuning
LongWriter conducts training based on two of the latest open-source models, namely GLM-4-9B and Llama-3.1-8B. Both of these are base models and support a context window of up to 128k tokens, making them naturally suitable for training on long outputs. To make the training more efficient, LongWriter adopts packing training with loss weighting. The training on the two models results in two models: LongWriter-9B (abbreviated for GLM-4-9B-LongWriter) and LongWriter-8B (abbreviated for Llama-3.1-8B-LongWriter).
At the same time, LongWriter notices that if the loss is averaged by sequence, i.e., taking the mean of each sequence’s average loss within a batch, the contribution of each target token to the loss in long output data would be significantly less than those with shorter outputs. In LongWriter’s experiments, it is also found that this leads to suboptimal model performance on tasks with long outputs. Therefore, LongWriter chooses a loss weighting strategy that averages the loss by token, where the loss is computed as the mean of losses across all target tokens within that batch.
All models are trained using a node with 8xH800 80G GPUs and DeepSpeed+ZeRO3+CPU offloading. LongWriter uses a batch size of 8, a learning rate of 1e-5, and a packing length of 32k. The models are trained for 4 epochs, which takes approximately 2,500-3,000 steps.
Alignment (DPO)
To further improve the model’s output quality and enhance its ability to follow length constraints in instructions, LongWriter performs direct preference optimization (DPO) on the supervised fine-tuned LongWriter-9B model. The DPO data comes from GLM-4’s chat DPO data (approximately 50k entries). Additionally, LongWriter constructs 4k pairs of data specifically targeting long-form writing instructions. For each writing instruction, LongWriter samples 4 outputs from LongWriter-9B and scores these outputs following a specific method. A length-following score is also combined as computed. The highest-scoring output is then selected as the positive sample, and one of the remaining three outputs is randomly chosen as the negative sample.
The resulting model, LongWriter-9B-DPO, is trained for 250 steps on the above data mixture. LongWriter follows a specific recipe for DPO training.
LongWriter: Experiments and Results
LongWriter evaluates 4 proprietary models and 5 open-source models on LongBench-Write, along with the trained LongWriter models. To the best of LongWriter’s knowledge, Suri-IORPO is the only prior model that is also aligned for long-form text generation. It is trained based on Mistral-7B-Instruct-v0.2 using LoRA. Consistent with the evaluation setup on LongWrite-Ruler, LongWriter sets the output temperature to 0.5 and configures the model’s generation max tokens parameter to the maximum allowed by its API call. For open-source models, it is set to 32,768.
Most previous models are unable to meet the length requirement of over 2,000 words, while LongWriter models consistently provide longer and richer responses to such prompts. 
Observing the output length score SlS_lSl​ for prompts in each required length range, LongWriter finds that previous models generally perform poorly (scoring below 70) on prompts in the [2k, 4k) range, with only Claude 3.5 Sonnet achieving a decent score. For prompts in the [4k, 20k) range, almost all previous models are completely unable to reach the target output length, even scoring 0 (meaning all output lengths are less than one-third of the required length). By adding training data from LongWriter-6k, LongWriter’s trained model can effectively reach the required output length while maintaining good quality, as suggested by the​ scores in the [2k, 20k) range and the scatter plots.
DPO effectively improves both the model’s output quality and its ability to follow length requirements in long generation. 
By comparing the scores of LongWriter-9B and LongWriter9B-DPO, we find that DPO significantly improves both Sl (+4%) and Sq (+3%) scores, and the improvement is consistent across all ranges. This shows that in long generation scenario, DPO still helps to improve the model’s output quality and can better align the model’s output length with 8 Preprint Figure 7: Cumulative average NLL loss of GLM4-9B and Llama-3.1-8B at different positions of LongWriter models’ outputs. Figure 8: LongWrite-Ruler test results of LongWriter models, showing their maximum generation lengths between 10k-20k words. the requested length. The latter conclusion has also been recently observed in Yuan et al. (2024) in shorter generations. We also manually annotate pairwise wins and losses for GPT-4o and three longwriter models on their outputs in LongBench-Write and visualize the results in Figure 9. We can see that humans prefer the DPO-trained model over LongWriter-9B in 58% of the cases. Moreover, despite having fewer parameters, LongWriter-9B-DPO achieves a tie with GPT-4o. 
The output length limit of the LongWriter models is extended to between 10k and 20k words, while more data with long outputs is required to support even longer outputs. 
Following the LongWrite-Ruler tes,we also present the LongWrite-Ruler test results of LongWriter models. The results suggest that their maximum generation lengths are between 10k-20k words. The lack of SFT data with longer outputs is likely the primary reason preventing the model from achieving longer output lengths. 
Final Thoughts
In this work, we have talked about LongWriter, an agent-based pipeline that decomposes ultra-long generation tasks into subtasks, identifies a 2,000-word generation limit for current LLMs and proposes increasing their output window size by adding long-output data during alignment. To automatically construct long-output data, LongWriter develops AgentWrite, an agent-based pipeline that uses off-the-shelf LLMs to create extended, coherent outputs. LongWriter successfully scales the output window size of current LLMs to over 10,000 words with the constructed LongWriter-6k. Extensive ablation studies on the training data demonstrate the effectiveness of this approach. For future work, LongWriter suggests the following three directions: 1. Expand the AgentWrite framework to construct data with longer outputs to further extend LLMs’ output window size. 2. Refine the AgentWrite framework to achieve higher quality long-output data. 3. Longer model outputs bring challenges to inference efficiency. Several methods have been proposed to improve inference efficiency. It is worth investigating how these methods can ensure improved model efficiency without compromising generation quality.
0 notes
turtleblogatlast · 10 months ago
Text
Something I like about Leo is that he’s honestly really chill? It’s easy to remember the moments where he’s being obnoxious or excitable but I feel like most of the time he’s incredibly “go with the flow” and has an overall affable demeanor.
#rottmnt#rise of the teenage mutant ninja turtles#rottmnt leo#rise leo#Genuinely speaking I feel like said demeanor is incredibly useful for when he has to charm and/or persuade people into listening to him#I have a whole post talking about Leo’s charm and how he consistently gets people to hear him out even if he’s annoyed or upset them#like they’ll still listen to what he has to say in full#his charisma stat is real and utilized quite often in this series I swear he’s not just a loser cringeboy all the time 😭#if he wants to persuade and/or charm then he honestly sooo often does#me listing the 400th reason why Leo grows up to be the worlds best ninja and a good 365 of those reasons are Leo’s various subterfuge skill#Like most episodes where he’s not the main focus (and even many where he is)#he’s a voice of reason who notices things quickly and is often the one taking point to talk down situations#something interesting I found between Leo and Mikey is that#Mikey tells people what they need to hear#Leo tells people what they want to hear#not only out of his own agenda either#when bullhop was wrecking their home leo was the one that negotiated to make the situation go smoother#even if he would have rather bullhop left#meanwhile Mikey is the one who bluntly tells things as it is#small character moment that means a lot to me#Mikey is an honest boy who is upfront about his feelings#Leo prefers to let people make their own decisions he wants them to through steering the convo in that direction#but he is easily cowed by guilt#regardless leo is a people person - he knows how to talk to them and how to manipulate/persuade#and I like that his bros know this and often push him forward to do the talking if they wanna charm someone into doing what they want#I think Leo’s hope speeches are also an example of this - he’s saying what people really want to hear (and often it’s ALSO what they NEED)#the further the series goes on the higher Leo’s inner stress rises and he just keeps that chill aura anyway#there’s a reason!!! he wanted to go to a SPA so badly!!#literally the first thing he does when he gets in is rest#no joke meditation would do him good? like- it’s a Leo thing and I genuinely think rise leo would be no different here
359 notes · View notes
z0mbzii · 1 year ago
Text
Tumblr media
t girl apollo I love you 2 the ends of the earth ♡
(Click 4 quality as per routine)
686 notes · View notes
blaithnne · 7 months ago
Text
Tumblr media
I like how I gave Frida two flags when she only needed one, meanwhile David has to squish both flags together. To be fair, it’s just like Frida to be incredibly organised and bring multiple back up flags, and for David to panic and not do that.
400 notes · View notes
marypickfords · 1 year ago
Photo
Tumblr media Tumblr media
Gentlemen Prefer Nature Girls (Doris Wishman, 1963)
831 notes · View notes
quietwingsinthesky · 3 days ago
Text
the thing is, i don’t disagree that what and how you choose to write about things is a reflection of you, the writer, and cannot be divorced from your beliefs and biases. but to assume that anyone can perfectly pick out your beliefs from a piece of fiction you’ve written is an extremely weird and paranoid thing to claim, especially if you’re positing that the secret truth is that the writer is a dangerous person and they’re “revealing” that through fiction that upsets you, personally. at the end of the day, the artist is not the art, and no matter how upsetting, no matter how much you dislike it, no matter how much it disgusts you, you don’t know the person who made it and you can only guess how who they are shaped what they made.
like, really, is it more likely that the secret “belief” that’s being revealed when you’re reading something that is really upsetting to you is that the author is a secret Bad Person™️ who wants all of this fantasy to happen in real life, or that the author knows it’s a safe and normal thing to create upsetting works of fiction and so did that.
61 notes · View notes
to-be-so-lonely-fl · 3 months ago
Text
I just logged onto my account after years, to this family we all built together for our mutual love for the boys.
I'm in absolute shock reading the news that Liam has sadly passed away. One direction was always so important to me, to us. It's a strange feeling knowing we will never hear him sing live again. We grew up with him, and watched him grow as a man. Now I may not agree with the lifestyle he chose after one direction but no one deserves to go the way he did. He had a son who will no longer have a dad, sisters who will no longer have a brother, parents who will no longer have a son.
This is heartbreaking news and I'm sending all my love to his family and friends at this awful time 💔❤️
❤️Rest in peace Liam , forever in our hearts ❤️
Tumblr media
143 notes · View notes
qprpbj · 6 months ago
Text
would it be a hot take if i said i.. almost prefer the musicals take on how darry WAS in school and then had to drop out… or
139 notes · View notes
amorchai · 2 months ago
Text
𝐘𝐎𝐔𝐑 𝐂𝐀𝐌𝐄𝐑𝐀 𝐑𝐎𝐋𝐋 𝐀𝐒 𝐋𝐎𝐔𝐈𝐒' 𝐆𝐅.
louis tomlinson x girlfriend!reader.
Tumblr media Tumblr media Tumblr media
amorchai masterlist . taglist
amorchai © ─ all rights reserved. no reposting/translating/copying will be tolerated.
70 notes · View notes